Method and device for detecting voice activity

ABSTRACT

The invention concerns a method for detecting voice activity in a digital speech signal, in at least a frequency band, for example by means of a detecting automaton whereof the status is controlled on the basis of an energy analysis of the signal. The control of said automaton, or more generally the determination of voice activity, comprises a comparison, in the frequency band, of two different versions of the speech signal one of which at least is a noise-corrected version.

BACKGROUND OF THE INVENTION

The present invention relates to digital techniques for processingspeech signals. It relates more particularly to the techniques utilizingvoice activity detection so as to perform different processingsdepending on whether the signal does or does not carry voice activity.

The digital techniques in question come under varied domains: coding ofspeech for transmission or storage, speech recognition, noise reduction,echo cancellation, etc.

The main difficulty with processes for detecting voice activity is thatof distinguishing between voice activity and the noise which accompaniesthe speech signal.

The document WO99/14737 describes a method of detecting voice activityin a digital speech signal processed on the basis of successive framesand in which an a priori denoising of the speech signal of each frame iscarried out on the basis of noise estimates obtained during theprocessing of one or more previous frames, and the variations in theenergy of the a priori denoised signal are analyzed so as to detect adegree of voice activity of the frame. By carrying out the detection ofvoice activity on the basis of an a priori denoised signal, theperformance of this detection is substantially improved when thesurrounding noise is relatively strong.

In the methods customarily used to detect voice activity, the energyvariations of the (direct or denoised) signal are analyzed with respectto a long-term average of the energy of this signal, a relative increasein the instantaneous energy suggesting the appearance of voice activity.

An aim of the present invention is to propose another type of analysisallowing voice activity detection which is robust to the noise which mayaccompany the speech signal.

SUMMARY OF THE INVENTION

According to the invention, there is proposed a method for detectingvoice activity in a digital speech signal in at least one frequencyband, whereby the voice activity is detected on the basis of an analysiscomprising a comparison, in the said frequency band, of two differentversions of the speech signal, one at least of which is a denoisedversion obtained by taking account of estimates of the noise included inthe signal.

This method can be executed over the entire frequency band of thesignal, or on a subband basis, as a function of the requirements of theapplication using voice activity detection.

Voice activity can be detected in a binary manner for each band, ormeasured by a continuously varying parameter which may result from thecomparison between the two different versions of the speech signal.

The comparison typically pertains to respective energies, evaluated inthe said frequency band, of the two different versions of the speechsignal, or to a monotonic function of these energies.

Another aspect of the present invention relates to a device fordetecting voice activity in a speech signal, comprising signalprocessing means designed to implement a method as defined hereinabove.

The invention further relates to a computer program, loadable into amemory associated with a processor, and comprising portions of code forimplementing a method as defined hereinabove upon the execution of thesaid program by the processor, as well as to a computer medium, on whichsuch a program is recorded.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of a signal processing chain using a voiceactivity detector according to the invention;

FIG. 2 is a schematic diagram of an exemplary voice activity detectoraccording to the invention;

FIGS. 3 and 4 are flow charts of signal processing operations performedin the detector of FIG. 2;

FIG. 5 is a graphic showing an exemplary profile of energies calculatedin the detector of FIG. 2 and illustrating the principle of voiceactivity detection;

FIG. 6 is a diagram of a detection automaton implemented in the detectorof FIG. 2;

FIG. 7 is a schematic diagram of another embodiment of a voice activitydetector according to the invention;

FIG. 8 is a flow chart of signal processing operations performed in thedetector of FIG. 7;

FIG. 9 is a graphic of a function used in the operations of FIG. 8.

DETAILED DESCRIPTION

The device of FIG. 1 processes a digital speech signal s. The signalprocessing chain represented produces voice activity decisions δ_(n,j)which are usable in a manner known per se by application units, notrepresented, affording functions such as speech coding, speechrecognition, noise reduction, echo cancellation, etc. The decisionsδ_(n,j) can comprise a frequency resolution (index j), this making itpossible to enhance applications operating in the frequency domain.

A windowing module 10 puts the signal s into the form of successivewindows or frames of index n, each consisting of a number N of samplesof digital signal. In a conventional manner, these frames may exhibitmutual overlaps. In the remainder of the present description, the frameswill be regarded, without this being in any way limiting, as consistingof N=256 samples at a sampling frequency F_(e) of 8 kHz, with a Hammingweighting in each window, and overlaps of 50% between consecutivewindows.

The signal frame is transformed into the frequency domain by a module 11applying a conventional fast Fourier transform algorithm (FFT) forcalculating the modulus of the spectrum of the signal. The module 11then delivers a set of N=256 frequency components of the speech signal,which are denoted S_(n,f), where n designates the current frame number,and f a frequency of the discrete spectrum. Owing to the properties ofdigital signals in the frequency domain, only the first N/2=128 samplesare used.

To calculate the estimates of the noise contained in the signal s, we donot use the frequency resolution available at the output of the fastFourier transform, but a lower resolution, determined by a number I offrequency subbands covering the [0,F_(e)/2] band of the signal. Eachsubband i (1≦i≦I) extends between a lower frequency f(i−1) and an upperfrequency f(i), with f(0)=0, and f(I)=F_(e)/2. This chopping intosubbands can be uniform (f(i)−f(i−1)=F_(e)/2I). It may also benon-uniform (for example according to a barks scale). A module 12calculates the respective averages of the spectral components S_(n,f) ofthe speech signal on a subband basis, for example through a uniformweighting such as:$S_{n,i} = {\frac{1}{{f(i)} - {f\left( {i - 1} \right)}}{\sum\limits_{f \in {\lbrack{{f{({i - 1})}},{f{(i)}}}\rbrack}}S_{n,f}}}$

This averaging reduces the fluctuations between the subbands byaveraging the contributions of the noise in these subbands, and thiswill reduce the variance of the noise estimator. Furthermore, thisaveraging makes it possible to reduce the complexity of the system.

The averaged spectral components S_(n,i) are addressed to a voiceactivity detection module 15 and to a noise estimation module 16.{circumflex over (B)} _(n,i) denotes the long-term estimate of the noisecomponent produced by the module 16 in relation to frame n and tosubband i.

These long-term estimates {circumflex over (B)} _(n,i) may for examplebe obtained in the manner described in WO99/14737. It is also possibleto use simple smoothing by means of an exponential window defined by aforget factor λ_(B):{circumflex over (B)} _(n,i)=λ_(B) .{circumflex over (B)}_(n−1,i)+(1−λ_(B)).S _(n,i)with λ_(B) equal to 1 if the voice activity detector 15 indicates thatsubband i bears voice activity, and equal to a value lying between 0 and1 otherwise.

Of course, it is possible to use other long-term estimatesrepresentative of the noise component included in the speech signal,these estimates may represent a long-term average, or else a minimum ofthe component S_(n,j) over a sufficiently long sliding window.

FIGS. 2 to 6 illustrate a first embodiment of the voice activitydetector 15. A denoising module 18 executes, for each frame n and eachsubband i, the operations corresponding to steps 180 to 187 of FIG. 3,so as to produce two denoised versions {circumflex over (E)}p_(1,n,i),{circumflex over (E)}p_(2,n,i) of the speech signal. This denoising isdone by non-linear spectral subtraction. The first version {circumflexover (E)}p_(1,n,i), is denoised in such a way as not to be less, in thespectral domain, than a fraction β1_(i) of the long-term estimate{circumflex over (B)}_(n−τ,i). The second version {circumflex over(E)}p_(2,n,i) is denoised in such a way as not to be less, in thespectral domain, than a fraction β2_(j) of the long-term estimate{circumflex over (B)}_(n−τ1,i). The quantity τ1 is a delay expressed asa number of frames, which may be fixed (for example τ1=1) or variable.The more confident one is in the voice activity detection, the smallerthe delay will be. The fractions β1_(i) and β2_(i) (such that β1_(i)>β2_(i)) may be dependent on or independent of subband i. Preferred valuescorrespond for β1_(i) to an attenuation of 10 dB, and for β2_(i) to anattenuation of 60 dB, i.e. β1_(i)≈0.3 and β2_(i)≈0.001.

In step 180, the module 18 calculates, with the resolution of thesubbands i, the frequency response Hp_(n,i) of the a prioridenoising-filter, according to:${Hp}_{n,j} = \frac{S_{n,i} - {\alpha_{{n - {\tau 1}},i}^{\prime} \cdot {\hat{B}}_{{n - {\tau 1}},i}}}{S_{{n - {\tau 2}},i}}$where τ2 is a positive or zero integer delay and α′_(n,i) is a noiseoverestimation coefficient. This overestimation coefficient α′_(n,i) maybe dependent on or independent of the frame index n and/or the subbandindex i. In a preferred embodiment, it depends both on n and i, and itis determined as described in document WO99/14737. A first denoising isperformed in step 181: {circumflex over (E)}p_(n,i)=Hp_(n,i).S_(n,i). Insteps 182 to 184, the spectral components {circumflex over (E)}p_(1,n,i)are calculated according {circumflex over (E)}p_(1,n,i)=max ({circumflexover (E)}p_(n,i):β1_(i).{circumflex over (B)} _(n−τ1,i)), and in steps185 to 187, the spectral components {circumflex over (E)}p_(2,n,i) arecalculated according to {circumflex over (E)}p_(2,n,i)=max({circumflexover (E)}p_(n,i):β2_(i).{circumflex over (B)} _(n−τ1,i)).

The voice activity detector 15 of FIG. 2 comprises a module 19 whichcalculates energies of the denoised versions of the signal {circumflexover (E)}p_(1,n,i) and {circumflex over (E)}p_(2,n,i) respectively lyingin m frequency bands designated by the index j (1≦j≦m, m≧1). Thisresolution may be the same as that of the subbands defined by the module12 (index i), or a finer resolution of possibly as much as the whole ofthe useful band [0, F_(e)/2] of the signal (case m=1). By way ofexample, the module 12 can define I=16 uniform subbands of the band [0,F_(e)/2], and the module 19 can retain m=3 wider bands, each band ofindex j covering the subbands of index i ranging from imin(j) toimax(j), with imin(1)=1, imin(j+1)=imax(j)+1 for 1≦j<m, and imax(m)=1.In step 190 (FIG. 3), the module 19 calculates the energies per band:$E_{l,n,j} = {\sum\limits_{i = {i\;{\min{(j)}}}}^{i\;{\max{(j)}}}{{\left\lbrack {{f(i)} - {f\left( {i - 1} \right)}} \right\rbrack \cdot \hat{E}}p_{l,n,i}^{2}}}$$E_{2,n,j} = {\sum\limits_{i = {i\;{\min{(j)}}}}^{i\;{\max{(j)}}}{{\left\lbrack {{f(i)} - {f\left( {i - 1} \right)}} \right\rbrack \cdot \hat{E}}p_{2,n,i}^{2}}}$

A module 20 of the voice activity detector 15 performs a temporalsmoothing of the energies E_(1,n,j) and E_(2,n,j) for each of the bandsof index j, this corresponding to steps 200 to 205 for FIG. 4. Thesmoothing of these two energies is performed by means of a determinedsmoothing window by comparing the energy E_(2,n,j) of the most denoisedversion with its previously calculated smoothed energy Ē_(2,n−1,j), orwith a value of the order of this smoothed energy Ē_(2,n−1,j), (tests200 and 201). This smoothing window can be an exponential window definedby a forget factor λ lying between 0 and 1. This forget factor λ cantake three values: the one λ_(r) very close to 0 (for example λ_(r)=0)chosen in step 202 if E_(2,n,j)≦Ē_(2,n−1,j); the second λ_(q) very closeto 1 (for example λ_(q)=0.99999) chosen in step 203 ifE_(2,n,j)>ΔĒ_(2,n−1,j), Δ being a coefficient bigger than 1; and thethird λ_(p) lying between 0 and λ_(q) (for example λ_(p)=0.98) chosen instep 204 if Ē_(2,n−1,j)<E_(2,n−1,j)≦ΔĒ_(2,n−1,j). The exponentialsmoothing with the forget factor λ is then performed conventionally instep 205 according to:Ē _(1,n,j) =λ.Ē _(1,n−1,j)+(1−λ).E _(1,n,j)Ē _(2,n,j) =λ.Ē _(2,n−1,j)+(1−λ).E _(2,n,j)

An exemplary variation over time of the energies E_(1,n,j) and E_(2,n,j)and of the smoothed energies Ē_(1,n,j), and Ē_(2,n,j) is represented inFIG. 5. It may be seen that good tracking of the smoothed energies isachieved when the forget factor is determined on the basis of thevariations in the energy E_(2,n,j) corresponding to the most denoisedversion of the signal. The forget factor λ_(p) makes it possible to takeinto account the increases in the level of the background noise, theenergy reductions being tracked by the forget factor λ_(r). The forgetfactor λ_(q) very close to 1 means that the smoothed energies do nottrack the abrupt energy increases due to speech. However, the factorλ_(q) remains slightly less than 1 so as to avoid errors caused by anincrease in the background noise which may arise during a fairly longperiod of speech.

The voice activity detection automaton is controlled in particular by aparameter resulting from a comparison of the energies E_(1,n,j) andE_(2,n,j). This parameter can in particular be the ratiod_(n,j)=E_(1,n,j)/E_(2,n,j). It may be seen in FIG. 5 that this ratiod_(n,j) allows proper detection of the speech phases (represented byhatching).

The control of the detection automaton can also use other parameters,such as a parameter related to the signal-to-noise ratio:snr_(n,j)=E_(1,n,j)/Ē_(1,n,j), this amounting to taking into account acomparison between the energies E_(1,n,j) and Ē_(1,n,j). The module 21for controlling the automata relating to the various bands of index jcalculates the parameters d_(n,j) and snr_(n,j) in step 210, thendetermines the state of the automata. The new state δ_(n,j) of theautomaton relating to band j depends on the previous state δ_(n−1,j), ond_(n,j) and on snr_(n,j), for example as indicated in the diagram ofFIG. 6.

Four states are possible: δ_(j)=0 detects silence, or absence of speech;δ_(j)=2 detects the presence of voice activity; and the states δ_(j)=1and δ_(j)=3 are intermediate states of ascent and descent. When theautomaton is in the silence state (δ_(n−1,j)=0), it remains there ifd_(n,j) exceeds a first threshold α1_(j), and if it switches to theascent state in the converse case. In the ascent state (δ_(n−1,j)=1), itreturns to the silence state if d_(n,j) exceeds a second thresholdα2_(j); and it switches to the speech state in the converse case. Whenthe automaton is in the speech state (δ_(n−1,j)=2), it remains there ifsnr_(n,j) exceeds a third threshold α3_(j), and it switches to thedescent state in the converse case. In the descent state (δ_(n−1,j)=3),the automaton returns to the speech state if snr_(n,j) exceeds a fourththreshold α4_(j), and it returns to the silence state in the conversecase. The thresholds α1_(j), α2_(j), α3_(j), and α4_(j) may be optimizedseparately for each of the frequency bands j.

It is also possible for the automata relating to the various bands to bemade to interact by the module 21.

In particular, it may force each of the automata relating to each of thesubbands to the speech state as soon as one among them is in the speechstate. In this case, the output of the voice activity detector 15relates to the whole of the signal band.

The two appendices to the present description show a source code in theC++ language, with a fixed-point data representation corresponding to animplementation of the exemplary voice activity detection methoddescribed hereinabove. To embody the detector, one possibility is totranslate this source code into executable code, to record it in aprogram memory associated with an appropriate signal processor, and tohave it executed by this processor on the input signals of the detector.The function a_priori_signal_power presented in appendix 1 correspondsto the operations incumbent on the modules 18 and 19 of the voiceactivity detector 15 of FIG. 2. The function voice_activity_detectorpresented in appendix 2 corresponds to the operations incumbent on themodules 20 and 21 of this detector.

In the particular example of the appendices, the following parametershave been employed: τ1=1; τ2=0; β1_(i)=0.3; β2_(i)=0.001; m=3; Δ=4.953;λ_(p)=0.98; λ_(q)=0.99999; λ_(r)=0; α1_(j)=α2_(j)=α4_(j)=1.221;α3_(j)=1.649. Table 1 hereinbelow gives the correspondences between thenotation employed in the above description and in the drawings and thatemployed in the appendix.

TABLE I subband I E[subband] S_(n,i) module Êp_(n,i) or Êp_(1,n,i) orÊp_(2,n,i) param.beta_a_priori1 β1_(j) param.beta_a_priori2 β2_(j) vadj-1 param.vad_number m P1[vad] E_(1,n,j−1) P1s[vad] Ē_(1,n,j−1) P2[vad]E_(2,n,j−1) P2s[vad] Ē_(2,n,j−1) DELTA_P Log(Δ) d Log(d_(n,j)) snrLog(snr_(n,j)) NOISE silence state ASCENT ascent state SIGNAL speechstate DESCENT descent state D_NOISE Log(α1_(j)) D_SIGNAL Log(α2_(j))SNR_SIGNAL LOG(α3_(j)) SNR_NOISE Log(α4_(j))

In the variant embodiment illustrated by FIG. 7, the denoising module 25of the voice activity detector 15 delivers a single denoised version{circumflex over (E)}p_(n,i) of the speech signal, so that the module 26calculates its energy E_(2,n,j) for each band j. The other version, inwhich the module 26 calculates the energy, is represented directly bythe non-denoised samples S_(n,i).

As before, various denoising processes may be applied by the module 25.In the example illustrated by steps 250 to 256 of FIG. 8, the denoisingis done by nonlinear spectral subtraction with a noise overestimationcoefficient dependent on a quantity ρ related to the signal-to-noiseratio. In steps 250 to 252, a preliminary denoising is performed foreach subband of index i according to:S′ _(n,i)=max(S _(n,i) −α.{circumflex over (B)} _(n−1,i) ;β.{circumflexover (B)} _(n−1,i))the preliminary overestimation coefficient being for example α=2, andthe fraction β possibly corresponding to a noise attenuation of theorder of 10 dB.

The quantity ρ is taken equal to the ratio S′_(n,i)/S_(n,i) in step 253.The overestimation factor f(ρ) varies in a nonlinear manner with thequantity ρ, for example as represented in FIG. 9. For the values of ρclosest to 0 (ρ<ρ₁), the signal-to-noise ratio is low, and it ispossible to take an overestimation factor f(ρ)=2. For the highest valuesof ρ (ρ₂≦ρ≦1), the noise is weak and need not be overestimated (f(ρ)=1).Between ρ₁ and ρ₂, f(ρ) decreases from 2 to 1, for example linearly. Thedenoising proper, providing the version {circumflex over (E)}p_(n,i) isperformed in steps 254 to 256:Êp _(n,i)=max(S _(n,i) −f(ρ).{circumflex over (B)} _(n−1,i);β.{circumflex over (B)} _(n−1,i))

The voice activity detector 15 considered with reference to FIG. 7 uses,in each frequency band of index j (and/or in full band), a detectionautomaton having two states, silence or speech. The energies E_(1,n,j)and E_(2,n,j) calculated by the module 26 are respectively thosecontained in the components S_(n,i) of the speech signal and thosecontained in the denoised components {circumflex over (E)}p_(n,i)calculated over the various bands as indicated in step 260 of FIG. 8.The comparison of the two different versions of the speech signalpertains to respective differences between the energies E_(1,n,j) andE_(2,n,j) and a lower bound of the energy E_(2,n,j) of the denoisedversion.

This lower bound E_(2min,j) can in particular correspond to a minimumvalue, over a sliding window, of the energy E_(2,n,j) of the denoisedversion of the speech signal in the frequency band considered. In thiscase, a module 27 stores in a memory of the first-in first-out type(FIFO) the L most recent values of the energy E_(2,n,j) of the denoisedsignal in each band j, over a sliding window representing for example ofthe order of 20 frames, and delivers the minimum energies$\begin{matrix}{E_{{2\min},j} = \min} & E_{2,{n - k},j} \\{0 \leq k < L} & \;\end{matrix}$over this window (step 270 of FIG. 8). In each band, this minimum energyE_(2min,j) serves as lower bound for the module 28 for controlling thedetection automaton, which uses a measure M_(j) given by $\begin{matrix}{M_{j} = {\frac{E_{2,n,j} - E_{{2\min},i}}{E_{2,n,j} - E_{{2\min},j}}.}} & \left( {{step}\mspace{14mu} 280} \right)\end{matrix}$

The automaton can be a simple binary automaton using a threshold A_(j),possibly dependent on the band considered: If M_(j)≧A_(j), the outputbit δ_(n,j) of the detector represents a silence state of the band j,and if M_(j)≦A_(j), it represents a speech state. As a variant, themodule 28 could deliver a nonbinary measure of the voice activity,represented by a decreasing function of M_(j).

As a variant, the lower bound E_(2min,j) used in step 280 could becalculated with the aid of an exponential window, with a forget factor.It could also be represented by the energy over band j of the quantityβ.{circumflex over (B)}_(n−1,i) serving as floor in the denoising byspectral subtraction.

In the foregoing, the analysis performed in order to decide on thepresence or absence of voice activity pertains directly to energies ofdifferent versions of the speech signal. Of course, the comparisonscould pertain to a monotonic function of these energies, for example alogarithm, or to a quantity having similar behavior to the energiesaccording to voice activity (for example the power).

APPENDIX 1/************************************************************************* * description * ----------- * NSS module: * signal power beforeVAD **************************************************************************//*-----------------------------------------------------------------------* * included files*-----------------------------------------------------------------------*/ #include <assert.h> #include “private.h”/*-----------------------------------------------------------------------* * private*-----------------------------------------------------------------------*/ Word32 power(Word16 module, Word16 beta, Word16 thd, Word16val);/*-----------------------------------------------------------------------* * a_priori_signal_power*-----------------------------------------------------------------------*/ void a_priori_signal_power ( /* IN */ Word16 *E, Word16*internal_state, Word16 *max_noise, Word16 *long_term_noise, Word16*frequential_scale, /* IN&OUT */ Word16 *alpha, /* OUT */ Word32 *P1,Word32 *P2 ) { int vad; for(vad = 0; vad < param.vad_number; vad++) {int start = param.vads[vad].first_subband_for_power; int stop =param.vads[vad].last_subband; int subband; int uniform_subband;uniform_subband = 1; for(subband = start; subband <= stop; subband++)if(param.subband_size[subband] != param.subband_size[start] )uniform_subband = 0; P1[vad] = 0; move32(); P2[vad] = 0; move32();test(); if(sub(internal_state[vad], NOISE) == 0) { for(subband = start;subband <= stop; subband++) { Word32 pwr; Word16 shift; Word16 module;Word16 alpha_long_term; alpha_long_term = shr(max_noise[subband], 2);move16(); test(); test(); if(sub(alpha_long_term, long_term_noise[subband]) >= 0) { alpha[subband] = 0×7fff; move16(); alpha_long_term =long_term_noise[subband]; move16(); } else if(sub(max_noise[subband],long_term_noise[subban d]) < 0) { alpha[subband] = 0×2000; move16();alpha_long_term = shr(long_term_noise[subband],2); move 16(); } else {alpha[subband] = div_s(alpha_long_term, long_term_noise [subband]);move16(); } module = sub(E[subband], shl(alpha_long_term, 2)); move16(); if(uniform_subband) { shift = shl(frequential_scale[subband], 1);move16(); } else { shift = add(param.subband_shift[subband], shl(frequential_scale[subband], 1)); move16(); } pwr = power(module,param.beta_a_priori1, long_term_noise [subband],long_term_noise[subband]); pwr = L_shr(pwr, shift); P1[vad] =L_add(P1[vad], pwr); move32(); pwr = power(module, param.beta_a_priori2,long_term_noise [subband], long_term_noise[subband]); pwr = L_shr(pwr,shift); P2[vad] = L_add(P2[vad], pwr); move32(); } } else { for(subband= start; subband <= stop; subband++) { Word32 pwr; Word16 shift; Word16module; Word16 alpha_long_term; alpha_long_term = mult(alpha[subband],long_term_noise [subband]); move16(); module = sub{E[subband],shl(alpha_long_term, 2}); move 16(); if(uniform_subband) { shift =sh1(frequential_scale[subband], 1); move16(); } else { shift =add(param.subband_shift[subband], sh1(frequen tial_scale[subband], 1));move16(); } pwr = power(module, param.beta_a_priori1, long_term_noise[subband], E[subband]); pwr = L_shr(pwr, shift); P1[vad] =L_add(P1[vad], pwr); move32(); pwr = power(module, param.beta_a_priori2,long_term_noise [subband], E[subband]); pwr = L_shr(pwr, shift); P2[vad]= L_add(P2[vad], pwr); move32(); } } } }/*-----------------------------------------------------------------------* * power*-----------------------------------------------------------------------*/ Word32 power(Word16 module, Word16 beta, Word16 thd, Word16 val){ Word32 power; test(); if(sub(module, mult(beta, thd)) <= 0) { Word16hi, lo; power = L_mult(val, val); move32(); L_Extract(power, &hi, &lo);power = Mpy_32_16(hi, lo, beta); move32(); L_Extract(power, &hi, &lo);power = Mpy_32_16(hi, lo, beta); move32(); } else { power =L_mult(module, module); move32(); } return(power); }

APPENDIX 2/************************************************************************* * description * ----------- * NSS module: * VAD **************************************************************************//*-----------------------------------------------------------------------* * included files*-----------------------------------------------------------------------*/ #include <assert.h> #include “private-h” #include “simutool.h”/*-----------------------------------------------------------------------* * private*-----------------------------------------------------------------------*/ #define DELTA_P (1.6 * 1024) #define D_NOISE (.2 * 1024) #defineD_SIGNAL (.2 * 1024) #define SNR_SIGNAL (.5 * 1024) #define SNR_NOISE(.2 * 1024)/*-----------------------------------------------------------------------* * voice_activity_detector/*-----------------------------------------------------------------------*/ void voice_activity_detector { /* IN */ Word32 *P1, Word32 *P2,Word16 frame_counter, /* IN&OUT */ Word32 *P1s, Word32 *P2s, Word16*internal_state, /* OUT */ Word16 *state } { int vad; int signal; intnoise; signal = 0; move16(); noise = 1; move16(); for(vad = 0; vad <param.vad_number; vad++) { Word16 snr, d; Word16 logP1, logP1s; Word16logP2, logP2s; logP2 = logfix(P2[vad]); move16(); logP2s =logfix(P2s[vad]); move16(); test(); if(L_sub(P2[vad], P2s[vad]) > 0) {Word16 hi1, lo1; Word16 hi2, lo2; L_Extract(L_sub(P1[vad], P1s[vad]),&hi1, &lo1); L_Extract(L_sub(P2[vad], P2s[vad]), &hi2, &lo2); test();if(sub(sub{logP2, logP2s}, DELTA_P) < 0) { P1s[vad] = L_add(P1s[vad],L_shr(Mpy_32_16(hi1, lo1, 0×6 666), 4)); move32(); P2s[vad] =L_add(P2s[vad], L_shr(Mpy_32_16(hi2, lo2, 0×6 666), 4)); move32(); }else { P1s[vad] = L_add(P1s[vad], L_shr(Mpy_32_16(hi1, lo1, 0×6 8db),13)); move32(); P2s[vad] = L_add(P2s[vad], L_shr(Mpy_32_16(hi2, lo2, 0×68db), 13)); move32(); } } else { P1s[vad] = P1[vad]; move32(); P2s[vad]= P2[vad]; move32(); } logP1 = logfix(P1[vad]); move16(); logP1s =logfix(P1s[vad]); move16(); d = sub(logP1, logP2); move16(); snr =sub(logP1, logP1s); move16(); ProbeFix16(“d”, &d, 1, 1.);ProbeFix16(“_snr”, &snr, 1, 1.); { Word16 pp; ProbeFix16(“p1”, &logP1,1, 1.); ProbeFix16(“p2”, &logP2, 1, 1.); ProbeFix16(“p1s”, &logP1s, 1,1.); ProbeFix16(“p2s”, &logP2s, 1, 1.); pp = logP2 − logP2s;ProbeFix16(“dp”, &pp, 1, 1.); } test(); if(sub(internal_state[vad],NOISE) == 0) goto LABEL_NOISE; test(); if(sub(internal_state[vad],ASCENT) == 0) goto LABEL_ASCENT; test(); if(sub(internal_state[vad],SIGNAL) == 0) goto LABEL_SIGNAL; test(); if(sub(internal_state[vad],DESCENT) == 0) goto LABEL_DESCENT; LABEL_NOISE: test(); if(sub(d,D_NOISE) < 0) { internal_state[vad] = ASCENT; move16(); } gotoLABEL_END_VAD; LABEL_ASCENT: test(); if(sub(d, D_SIGNAL) < 0) {internal_state[vad] = SIGNAL; move16(); signal = 1; move16(); noise = 0;move16(); } else { internal_state[vad] = NOISE; move16(); } gotoLABEL_END_VAD; LABEL_SIGNAL: test(); if(sub(snr, SNR_SIGNAL) < 0) {internal_state[vad] = DESCENT; move16(); } else { signal = 1; move16();} noise = 0; move16(); goto LABEL_END_VAD; LABEL_DESCENT: test();if(sub(snr, SNR_NOISE) < 0) { internal_state[vad] = NOISE; move16(); }else { internal_state[vad] = SIGNAL; move16(); signal = 1; move16();noise = 0; move16(); } goto LABEL_END_VAD; LABEL_END_VAD: ; } *state =TRANSITION; move16(); test(); test(); if(signal != 0) { test();if(sub(frame_counter, param.init_frame_number) >= 0) { for(vad = 0; vad< param.vad_number; vad++) { internal_state[vad] = SIGNAL; move16(); }*state = SIGNAL; move16(); } } else if(noise != 0) { *state = NOISE;move16(); } }

1. Method for detecting voice activity in a digital speech signal in atleast one frequency band, wherein the voice activity is detected on thebasis of an analysis comprising the step of comparing two differentversions of the speech signal, wherein the two different versions of thespeech signal are two versions denoised by non-linear spectralsubtraction, wherein a first of the two versions is denoised in such away as not to be less, in the spectral domain, than a first fraction ofa long-term estimate representative of a noise component included in thespeech signal, and the second of the two versions is denoised in such away as not to be less, in the spectral domain, than a second fraction ofsaid long-term estimate, smaller than said first fraction.
 2. Methodaccording to claim 1, wherein said comparison is performed on respectiveenergies, evaluated in said frequency band, of the two differentversions of the speech signal, or to a monotonic function of saidenergies.
 3. Method according to claim 1, wherein said analysis furthercomprises a time smoothing of the energy of one of said versions of thespeech signal, and a comparison between the energy of said version andthe smoothed energy.
 4. Method according to claim 3, wherein thecomparison between the energy of said version and the smoothed energycontrols transitions of a voice activity detection automaton from aspeech state to a silence state, and wherein the comparison of the twodifferent versions of the speech signal controls transitions of thedetection automaton from the silence state to the speech state. 5.Method according to claim 1, wherein said analysis further comprises atime smoothing of the energy of each of the two versions of the speechsignal, by means of a smoothing window determined by comparing theenergy of the second of the two versions with the smoothed energy of thesecond of the two versions.
 6. Method according to claim 5, wherein thesmoothing window is an exponential window defined by a forgettingfactor.
 7. Method according to claim 6, comprising the step ofallocating a substantially zero value to the forgetting factor when theenergy of the second of the two versions is less than a value of theorder of the smoothed energy of the second of the two versions. 8.Method according to claim 7, comprising the step of allocating a firstvalue substantially equal to 1 to the forgetting factor when the energyof the second of the two versions is greater than said value of theorder of the smoothed energy multiplied by a coefficient bigger than 1,and allocating a second value lying between 0 and said first value tothe forgetting factor when the energy of the second of the two versionsis greater than said value of the order of the smoothed energy and lessthan said value of the order of the smoothed energy multiplied by saidcoefficient.
 9. Method according to claim 1, wherein the first andsecond fractions correspond substantially to attenuations of 10 dB and60 dB, respectively.
 10. Method according to claim 1, wherein thecomparison of the two different versions of the speech signal isperformed on respective differences between the energies of said twoversions in said frequency band and a lower bound of the energy of thedenoised version of the speech signal in said frequency band.
 11. Devicefor detecting voice activity in a speech signal, comprising signalprocessing means for analyzing the speech signal in at least onefrequency band, wherein the processing means comprise: first non-linearspectral subtraction means to provide a first version of the speechsignal as a denoised version which is not less, in the spectral domain,than a first fraction of a long-term estimate representative of a noisecomponent included in the speech signal; second non-linear spectralsubtraction means to provide a second version of the speech signal as adenoised version which is not less, in the spectral domain, than asecond fraction of said long-term estimate, said second fraction beingsmaller than said first fraction; and means for comparing the first andsecond versions of the speech signal.
 12. Device according to claim 11,wherein the processing means comprise means for evaluating, in saidfrequency band, energies of said first and second versions of the speechsignal, whereby inputs of the comparison means comprise said energies ora monotonic function of said energies.
 13. Device according to claim 11,wherein the processing means further comprises means for performing atime smoothing of the energy of one of said first and second versions ofthe speech signal, and means for comparing the energy of said versionand the smoothed energy.
 14. Device according to claim 13, wherein theprocessing means comprise a voice activity detection automaton having aplurality of states including a speech state and a silence state, meansfor controlling transitions of the voice activity detection automatonfrom the speech state to the silence state based on a comparison betweenthe energy of said one of said first and second versions and thesmoothed energy, and means for controlling transitions of the voiceactivity detection automaton from the silence state to the speech statebased on a comparison of the first and second versions of the speechsignal.
 15. Device according to claim 11, wherein the processing meansfurther comprises means for performing a time smoothing of the energy ofeach of the first and second versions of the speech signal, by means ofa smoothing window determined by comparing an energy of the secondversion with the smoothed energy of the second version.
 16. Deviceaccording to claim 15, wherein the smoothing window is an exponentialwindow defined by a forgetting factor.
 17. Device according to claim 16,wherein the processing means further comprises means for allocating asubstantially zero value to the forgetting factor when the energy of thesecond version is less than a value of the order of the smoothed energyof the second version.
 18. Device according to claim 17, wherein theprocessing means further comprises means for allocating a first valuesubstantially equal to 1 to the forgetting factor when the energy of thesecond version is greater than said value of the order of the smoothedenergy multiplied by a coefficient bigger than 1, and for allocating asecond value lying between 0 and said first value to the forgettingfactor when the energy of the second version is greater than said valueof the order of the smoothed energy and less than said value of theorder of the smoothed energy multiplied by said coefficient.
 19. Deviceaccording to claim 11, wherein the first and second fractions correspondsubstantially to attenuations of 10 dB and 60 dB, respectively. 20.Device according to claim 11, wherein the comparison of the first andsecond versions of the speech signal is performed on respectivedifferences between the energies of said first and second versions insaid frequency band and a lower bound of the energy of the denoisedversion of the speech signal in said frequency band.
 21. A computerprogram product, loadable into a memory associated with a processor, andcomprising portions of code for execution by the processor to detectvoice activity in an input digital speech signal in at least onefrequency band, whereby the voice activity is detected on the basis ofan analysis comprising the step of comparing two different versions ofthe speech signal, wherein the two different versions of the speechsignal are two versions denoised by non-linear spectral subtraction,wherein a first of the two versions is denoised in such a way as not tobe less, in the spectral domain, than a first fraction of a long-termestimate representative of a noise component included in the speechsignal, and the second of the two versions is denoised in such a way asnot to be less, in the spectral domain, than a second fraction of saidlong-term estimate, smaller than said first fraction.
 22. A computerprogram product according to claim 21, wherein said comparison isperformed on respective energies, evaluated in said frequency band, ofthe two different versions of the speech signal, or to a monotonicfunction of said energies.
 23. A computer program product according toclaim 21, wherein said analysis further comprises a time smoothing ofthe energy of one of said versions of the speech signal, and acomparison between the energy of said version and the smoothed energy.24. A computer program product according to claim 23, wherein thecomparison between the energy of said version and the smoothed energycontrols transitions of a voice activity detection automaton from aspeech state to a silence state, and wherein the comparison of the twodifferent versions of the speech signal controls transitions of thedetection automaton from the silence state to the speech state.
 25. Acomputer program product according to claim 21, wherein said analysisfurther comprises a time smoothing of the energy of each of the twoversions of the speech signal, by means of a smoothing window determinedby comparing the energy of the second of the two versions with thesmoothed energy of the second of the two versions.
 26. A computerprogram product according to claim 25, wherein the smoothing window isan exponential window defined by a forgetting factor.
 27. A computerprogram product according to claim 26, wherein said analysis furthercomprises the step of allocating a substantially zero value to theforgetting factor when the energy of the second of the two versions isless than a value of the order of the smoothed energy of the second ofthe two versions.
 28. A computer program product according to claim 27,wherein said analysis further comprises the steps of allocating a firstvalue substantially equal to 1 to the forgetting factor when the energyof the second of the two versions is greater than said value of theorder of the smoothed energy multiplied by a coefficient bigger than 1,and allocating a second value lying between 0 and said first value tothe forgetting factor when the energy of the second of the two versionsis greater than said value of the order of the smoothed energy and lessthan said value of the order of the smoothed energy multiplied by saidcoefficient.
 29. A computer program product according to claim 21,wherein the first and second fractions correspond substantially toattenuations of 10 dB and 60 dB, respectively.
 30. A computer programproduct according to claim 21, wherein the comparison of the twodifferent versions of the speech signal is performed on respectivedifferences between the energies of said two versions in said frequencyband and a lower bound of the energy of the denoised version of thespeech signal in said frequency band.